This is intended to be a handy cheatsheet for a data scientist from end to end.
For most used data manipulation techniques in R, Python, and Hive, see data_manipulation_dict.xlsx.
Here are the most commonly used visualizations:
## [1] "GOOG"
Lines mean - Q1 - 1.5 * IQR - Q1 - Median - Q3 - Q3 + 1.5 * IQR
For an event A where \(P(A) \in [0,1]\),
\[P'(A) = 1 - P(A)\] \[P(A \cup B) = P(A) + P(B) - P(A \cap B)\] \[P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{P(B|A)P(A)}{P(B)}\] \[P(A \cap B) = P(B|A)P(A) \]
Population mean - \(\mu = \frac{1}{N} * \sum{X}\)
Sample mean - \(\bar{x} = \frac{1}{n} * \sum{x}\)
Population variance - \(\sigma^2 = \frac{1}{N} * \sum{(X-\mu)^2}\)
Sample variance - \(s^2 = \frac{1}{n-1} * \sum{(x-\bar{x})^2}\)
Standard error of the means - \(SEM = s/\sqrt{n}\)
Confidence interval - a range that, with repeated iterations, some percentage of sample values go in. \(\bar{x} \pm t*SEM\)
Degree of freedom - number of values in a calculation that is free to vary. All independent variables less the number of intermediate paramenters.
\[Var(x) = E[(x-E[x])^2]\]
\[cov(X,Y) = E[(x-E(x))(y-E(y))]\]
\[corr(x,y) = \frac{cov(X,Y)}{\sigma_x * \sigma_y}\]
Oftentimes with big data, we rely on the assumption that the data will approximate to a normal distribution. However, it helps to understand what distributions are meant for what types of data since some data such as number of purchases per customer clearly do not approximate in such manner.
When the number of samples/trials is large enough, the sample mean approaches the population mean. For example, in a series of dice rolls:
The sample means of independent and identically distributed samples of any distribution with mean and variance will converge to a normal distribution when the number of samples increases.
## [1] "Uniform Distribution"
## [1] "Poisson Distribution"
## [1] "Exponential Distribution"
Testing \(\frac{\bar{x} - \mu}{SEM}\)
| Goal | Continuous | Discrete | Proportions | Non-parametric |
|---|---|---|---|---|
One-sample |
T-test |
|
|
Wilcoxon test |
Paired |
Paired t-test |
McNemar’s test |
|
Wilcoxon test |
Two-sample |
two-sample t |
Chi^2 |
|
Mann-Whitney U |
Multi-sample |
ANOVA |
Kruskall-Wallis |
Mostly t-test and its variations
\[\frac{\bar{x} - \mu}{s/\sqrt{n}}\]
\[\frac{(\bar{x_1} - \bar{x_2}) - (\mu_1-\mu_2)}{s_p * \sqrt{\frac{1}{m} + \frac{1}{n}}}\]
\[s_p=\sqrt{\frac{(m-1)(s_{x_1})^2 + (n-1)(s_{x_2})^2}{m+n-2}}\]
##
## Paired t-test
##
## data: a and b
## t = 0.98041, df = 99, p-value = 0.3293
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1329513 0.3926566
## sample estimates:
## mean of the differences
## 0.1298527
\[Expected = \frac{total_1}{total_{all}} * \frac{total_2}{total_{all}} * total_{all}\]
\[X^2 = \frac{(observed - expected)^2}{expected}\]
\[df = (m-1) * (n-1)\]
where m and n are respective number of levels.
Used for product mix, matrix-wise CTR and other discrete variables in contingency tables.
## clicked not
## 1 10 50
## 2 20 40
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: c
## X-squared = 3.6, df = 1, p-value = 0.05778
\[X^2 = \frac{(b-c)^2}{b+c}\]
where b and c are the number of discordant pairs. Chi^2 distribution with degree of freedom 1.
Normal distribution with enough number of samples. Used for CTR, conversion and other proportions.
\[z = \frac{(p-p_0)}{\sqrt{\frac{p_0(1-p_0)}{n}}}\]
\[z = \frac{(p_1-p_2)}{\sqrt{p(1-p) * (\frac{1}{n_1}+\frac{1}{n_2})}}\]
where p is overall proportions including both groups
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: c(32, 54) out of c(1000, 1000)
## X-squared = 5.3583, df = 1, p-value = 0.02062
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.040754721 -0.003245279
## sample estimates:
## prop 1 prop 2
## 0.032 0.054
Null hypothesis that all central values are the same. Assumes homoskedasticity and normality of outcomes.
N - total number of samples
k - number of groups
ni - number of members of group i
Mean Squares
Total Mean Squares - \(\frac{\sum{(x-\bar{x.all})^2}}{N-1}\)
Between-grop Mean Squares - \(\frac{\sum{(\bar{x.group}-\bar{x.all})^2}}{k-1}\)
Within-group Mean Squares - \(\frac{\sum{(x-\bar{x.group})^2}}{N-k}\)
\(F = \frac{Between-grop Mean Squares}{Within-group Mean Squares}\)
The probability of finding the observed data given the null hypothesis.
What most people think of it as:
\(P(Null|Data) = \frac{P(Null) * P(Data|Null)}{P(Data)} = \frac{P(Null) * P(Data|Null)}{P(Null) * P(Data|Null) + P(!Data) * P(Null|!Data)}\)
What it really is:
\(P(Data|Null)\)
Example: Out of 100 samples, 0.1 prevalence, 0.8 power, and 0.05 significance. We will detect 8 out of 10 true positives and detect 5 false positives. Precision is (13-5)/13 = 61.5%, not 95%.
\(\frac{0.1*0.8}{(0.1*0.8)+(0.05*0.9)}\)
The more k, the more likely to get false positives.
\(P(!Data|Null) = (1-a)^k\)
\[a_c = \frac{a}{k}\] where k is number of experiments
\[ a = 1 - (1-a_c)^k \] \[ a_c = 1 - (1-a)^{1/k} \]
Benjamini-Hochberg procedure
Order m hypotheses according to lower p-value on a rank-p plane
Draw a line \(p < \frac{a}{m} * rank\)
Find lowest ranks that are still above the line
Limit the false discovery rate \(FDR < \frac{a}{m} * TP\)
\[ LR = \frac{TPR}{FPR} = \frac{sensitivity}{1-specificity} \]
\[ \lambda = \frac{LR_{fitModel}}{LR_{reducedModel}} \]
\[LL = -2log\lambda\]
AIC penalizes more features.
\[AIC = LL + 2k\]
where k is the number of features
Stricter than AIC by log(n)
\[BIC = LL + klog(n)\]
where n is the number of examples.
Make them look like Z(0,1)
\[ x_{standardized} = \frac{x - \overline{x}}{s_x} \]
\[ x_{normalized} = \frac{x - min(x)}{max(x) - min(x)} \]
We handle imbalance classes with:
##
## 0 1
## 980 20
##
## 0 1
## 20 20
##
## 0 1
## 980 978
##
## 0 1
## 500 500
Synthetic Minority Over-sampling Technique: The algorithm selects k neighbors (using a distance measure) and perturbing an instance one attribute at a time by a random amount within the difference to the neighboring instances.
##
## 0 1
## 2000 1020
Because tree-based models use information criteria to split instead.
Truncated SVD used to reduce dimensions.
\[ M = U \Sigma V^T \]
dimensions are m x m, m x n and n x n respectively.
Standardize features
Compute covariance matrix \((1/m) * X'X\)
Perform SVD; Find eigenvector U of X’X
\[ A = U * S * V' \]
## Standard deviations:
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
##
## Rotation:
## PC1 PC2 PC3 PC4
## Murder -0.5358995 0.4181809 -0.3412327 0.64922780
## Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
## Rape -0.5434321 -0.1673186 0.8177779 0.08902432
Maximizes the margin between two classes with hyperparameter delta
For an example:
Loss per class per example is positive only when the score of wrong class is more than score of correct class by delta. \[L_{i} = \Sigma max(0,f_{j} - f_{k} + \Delta)\] where j != k
Derivative of loss function by weight \[\frac{dL_{i}}{dW} = \frac{dL_{i}}{df_{j-or-k}} * \frac{df_{j-or-k}}{dW} = \frac{dL_{i}}{df_{j-or-k}} * X_{i}\]
Minus sum of everything that has enough delta \[\frac{dL_{i}}{df_{k}} = -\Sigma 1(f_{j} - f_{k} + \Delta)\]
Plus one for any other arbitrary classes \[\frac{dL_{i}}{df_{j}} = 1(f_{j} - f_{k} + \Delta > 0)\]
For an example: \[L_{i} = -log(p_{k})\] \[p_{k} = \frac{e^{f_{k}}}{\Sigma e^{f_{j}}}\] \[f = X_{i}W + b\] when i = example; k = correct class; j = any class
For regularization term: \[l2 = \frac{\lambda}{2}W^2\] \[l1 = \frac{\lambda}{2}|W|\]
Overall Loss: \[L = \frac{\Sigma L_{i}}{N} + l1/l2\]
Derivative of loss function by weight \[\frac{dL_{i}}{dW} = \frac{dL_{i}}{df_{k}} * \frac{df_{k}}{dW} = \frac{dL_{i}}{df_{k}} * X_{i}\] \[\frac{dL_{i}}{df_{k}} = \frac{-dlog(p_{k})}{df_{k}} = \frac{-1}{p_{k}}\frac{dp_{k}}{df_{k}} = \frac{-1}{p_{k}}\frac{d(\frac{e^{f_{k}}}{\Sigma e^{f_{j}}})}{df_{k}}\] \[\frac{d(\frac{e^{f_{k}}}{\Sigma e^{f_{j}}})}{df_{k}} = \frac{\Sigma e^{f_{j}}e^{f_{k}} - e^{2f_{k}}}{(\Sigma e^{f_{j}})^2} = p_{k} - p_{k}^2 = p_{k}(1-p_{k})\] \[\frac{dL_{i}}{df_{k}} = p_{k} - 1(y_{i}=k)\] \[\frac{dL_{i}}{dW} = [p_{k} - 1(y_{i}=k)] * X_{i}\]
\[L_i = \Sigma (y - f_j)^2 \]
\[L_i = \Sigma |y - f_j| \]
\[\frac{dL_{i}}{dW} = \frac{dL_{i}}{df_j} * \frac{df_j}{dW} = \frac{dL_{i}}{df_j} * X_{i}\] \[\frac{dL_{i}}{dW} = -2(y-f_j) * X_i\]
Let \(p_i\) be the percentage of examples labelled as i
The more, the less homogenous.
\[I = \Sigma p_i(1-p_i)\]
Difference between entropy of parent and child
\[IG = H(T) - H(T|a)\]
\[H(T) = - \Sigma p_ilog_2p_i\]
l2 loss reduction used in regression tree.
\[cosineSimilarity = \frac{A \cdot B}{||A||_{2}||B||_{2}} = \frac{\Sigma A_i B_i}{\sqrt{\Sigma A_i^2 \Sigma B_i^2}}\]
\[JaccardSimilarity = \frac{|A \cap B|}{|A \cup B|} = \frac{|A \cap B|}{|A| + |B| - |A \cap B|}\]
\[Manhattan(l1) = \Sigma|A_i-B_i|\]
\[Euclidean(l2) =\sqrt{\Sigma(A_i-B_i)^2}\]
Plot training size vs error or polynomial degree vs error with validation and training set
The implementation can be divided into the following:
Notable features:
Assume equal variances within groups
For convex and isotropic
Require standardization
Builds the hierarchy from the individual elements by progressively merging clusters. It minimizes: * Ward: sum of square differences * Complete linkage: maximum distance * Average linkage: average distance
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
BLUE - best linear unbiased estimator iid - independent and identically distributed
Perform splits on each node according to information measures.
Nearest k points take a vote according to distance * Small k; fast, more efficient, more strict * Large k; bias towards popular labels, ignores outliers
## Support Vector Machines with Radial Basis Function Kernel
##
## 1000 samples
## 6 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1000, 1000, 1000, 1000, 1000, 1000, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared
## 0.25 0.4628845 0.8889003
## 0.50 0.4638732 0.8868725
## 1.00 0.4716239 0.8825424
##
## Tuning parameter 'sigma' was held constant at a value of 0.04650196
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.04650196 and C = 0.25.
Multi-class generalization of logistic regression.
##
## Call:
## glm(formula = vs ~ wt + gear, family = "binomial", data = mtcars)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6425 -0.7830 -0.1126 0.5729 1.7588
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 10.9370 5.4987 1.989 0.0467 *
## wt -2.5077 0.9866 -2.542 0.0110 *
## gear -0.9044 0.8124 -1.113 0.2656
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.860 on 31 degrees of freedom
## Residual deviance: 29.944 on 29 degrees of freedom
## AIC: 35.944
##
## Number of Fisher Scoring iterations: 5
\[P(spam|word) = \frac{P(word|spam) * P(spam)}{P(word)}\] \[P(spam|word)= \frac{P(word1, word2,word3|spam) * P(spam)}{P(word)}\] \[P(spam|word)= \frac{P(word1|spam) * P(word2,word3|spam,word1) * P(spam)}{P(word)}\] It’s naive because it assumes that chances of words appearing are independent \[P(spam|word) = \frac{P(word1|spam) * P(word2|spam)* P(word3|spam) * P(spam)}{P(word)}\]
## Naive Bayes
##
## 150 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 150, 150, 150, 150, 150, 150, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.9525666 0.9281677
## TRUE 0.9491480 0.9229907
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = FALSE
## and adjust = 1.
Matrix multiplication based on similarity matrix or embedded vectors.
Fit h(x) aka the model improvement term to residuals of previous model to improve the model.
\(h(x) = y - F_m\)
\(F_{m+1} = F_m + v * h(x)\)
v = shrinkage (learning rate)
Prediction procedure:
\[logOdds_{j} = log(\frac{prob_{positive}}{prob_{negative}})\] \[logOdds_{i} = \Sigma logOdds_{j}\] \[P(NC) = \frac{e^{logOdds_{i}}}{1+e^{logOdds_{i}}}\] where i is an example and j is a transition within the example.
\(max(0,f_j)\)
Correlation filter turned 180 degrees.
Reduce resolution by choosing max value.
Has an additive memory cell to prevent vanishing gradients. Let \(s_t\) be the score.
Forget gate: how much of old information to retain
\[f_t = \sigma_{sigmoid}(s_t + U_fh_{t-1})\]
Information gate: how much of new information to receive
\[i_t = \sigma_{sigmoid}(s_t + U_ih_{t-1})\]
Output gate
\[o_t = \sigma_{sigmoid}(s_t + U_ih_{t-1})\]
This is where the magic happens.
\[c_t = f_t \circ c_{t-1} + i_t \circ \sigma_{tanh}(s_t + U_ih_{t-1})\]
\[h_t = o_t \circ \sigma_{tanh}(ct)\]
Drop some features.
Normalize activatin to Z(0,1)
Map integer to vector space
\[\frac{dL_{i}}{dW} = \alpha * \frac{dL_{i}}{dW} + \beta * \frac{dL_{i-1}}{dW}\]
\[learn_{\beta} = \frac{learn_{base}}{\sqrt{\Sigma (\frac{dL}{d\beta})^2}}\]
\[learn_{\beta} = \frac{learn_{base}}{\gamma * rms_{current} + \nu * rms_{previous}}\]
\[ r_{t} = \frac{|loss_{t} - loss_{t-1}|}{max(loss_{t},loss_{t-1})}\] \[ dt = \gamma * r{t} + \nu * r_{t-1} \]